# Import important libraries needed for functions
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(rmarkdown)
library(summarytools)
## 
## Attaching package: 'summarytools'
## 
## The following object is masked from 'package:tibble':
## 
##     view
library(dplyr)
library(lubridate)
## Loading required package: timechange
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
  1. Dataset Description

The dataset is San Francisco dataset, from 2018 to present. The description of each column is -

Incident Datetime - The date and time when the incident occurred

Incident Date - The date the incident occurred

Incident Time- The time the incident occurred

Incident Year - The year the incident occurred, provided as a convenience for filtering

Incident Day of Week - The day of week the incident occurred

Report Datetime - Distinct from Incident Datetime, Report Datetime is when the report was filed.

Row ID - A unique identifier for each row of data in the dataset

Incident ID - This is the system generated identifier for incident reports. Incident IDs and Incident Numbers both uniquely identify reports, but Incident Numbers are used when referencing cases and report documents.

Incident Number - The number issued on the report, sometimes interchangeably referred to as the Case Number. This number is used to reference cases and report documents.

CAD Number - The Computer Aided Dispatch (CAD) is the system used by the Department of Emergency Management (DEM) to dispatch officers and other public safety personnel. CAD Numbers are assigned by the DEM system and linked to relevant incident reports (Incident Number).

Report Type Code - A system code for report types, these have corresponding descriptions within the dataset.

Report Type Description - The description of the report type, can be one of: Initial; Initial Supplement; Vehicle Initial; Vehicle Supplement; Coplogic Initial; Coplogic Supplement

Filed Online - Non- emergency police reports can be filed online by members of the public using SFPD’s self-service reporting system called Coplogic Values in this field will be “TRUE” if Coplogic was used to file the report.

Incident Code - Incident Codes are the system codes to describe a type of incident. A single incident report can have one or more incident types associated. In those cases you will see multiple rows representing a unique combination of the Incident ID and Incident Code.

Incident Category - A category mapped on to the Incident Code used in statistics and reporting. Mappings provided by the Crime Analysis Unit of the Police Department.

Incident Subcategory - A subcategory mapped to the Incident Code that is used for statistics and reporting. Mappings are provided by the Crime Analysis Unit of the Police Department.

Incident Description - The description of the incident that corresponds with the Incident Code. These are generally self-explanatory.

Resolution - The resolution of the incident at the time of the report. Can be one of: • Cite or Arrest Adult • Cite or Arrest Juvenile* • Exceptional Adult • Exceptional Juvenile* • Open or Active • Unfounded Juvenile information not maintained in the dataset.

Intersection - The 2 or more street names that intersect closest to the original incident separated by a backward slash.

CNN - The unique identifier of the intersection for reference back to other related basemap datasets.

Police District - The Police District where the incident occurred. District boundaries can be reviewed in the link below.

Analysis Neighborhood - This field is used to identify the neighborhood where each incident occurs. Neighborhoods and boundaries are defined by the Department of Public Health and the Mayor’s Office of Housing and Community Development.

Supervisor District - There are 11 members elected to the Board of Supervisors in San Francisco, each representing a geographic district. The Board of Supervisors is the legislative body for San Francisco. The districts are numbered 1 through 11.

Latitude - The latitude coordinate in WGS84, spatial reference is EPSG:4326

Longitude - The longitude coordinate in WGS84, spatial reference is EPSG:4326

Point - Geolocation in OGC WKT format (e.g, POINT(37.4,-122.3)

As we can see that this dataset is large, with many fields.To perform some infer analysis and infer from the data, we will need to perform some cleaning, tidying and identify which of these fields can be beneficial to us.

As we can see that this dataset is large, with many fields.To perform some infer analysis and infer from the data, we will need to perform some cleaning, tidying and identify which of these fields can be beneficial to us.

###Dataset

Reading in the data

# Read data from the csv file
data <- read.csv("_data/SF_Incident_Reports__2018_to_Present.csv")

The dimensions of the data

dim(data)
## [1] 669605     34

The datatype of each column

str(data)
## 'data.frame':    669605 obs. of  34 variables:
##  $ Incident.Datetime                                   : chr  "2021/07/25 12:00:00 AM" "2022/06/28 11:58:00 PM" "2022/03/11 10:30:00 AM" "2021/05/15 05:47:00 PM" ...
##  $ Incident.Date                                       : chr  "2021/07/25" "2022/06/28" "2022/03/11" "2021/05/15" ...
##  $ Incident.Time                                       : chr  "00:00" "23:58" "10:30" "17:47" ...
##  $ Incident.Year                                       : int  2021 2022 2022 2021 2022 2021 2022 2022 2022 2018 ...
##  $ Incident.Day.of.Week                                : chr  "Sunday" "Tuesday" "Friday" "Saturday" ...
##  $ Report.Datetime                                     : chr  "2021/07/25 01:41:00 PM" "2022/06/28 11:58:00 PM" "2022/03/11 08:03:00 PM" "2021/05/15 05:47:00 PM" ...
##  $ Row.ID                                              : num  1.06e+11 1.17e+11 1.13e+11 1.03e+11 1.17e+11 ...
##  $ Incident.ID                                         : int  1057189 1165543 1130480 1030518 1165351 1094352 1165462 1182772 1147129 669184 ...
##  $ Incident.Number                                     : int  216105573 220264913 226040232 210183345 220361741 216178097 226109026 226147327 210618041 186110767 ...
##  $ CAD.Number                                          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Report.Type.Code                                    : chr  "II" "VS" "II" "VS" ...
##  $ Report.Type.Description                             : chr  "Coplogic Initial" "Vehicle Supplement" "Coplogic Initial" "Vehicle Supplement" ...
##  $ Filed.Online                                        : chr  "true" "" "true" "" ...
##  $ Incident.Code                                       : int  6372 71012 71000 7043 7041 28150 6244 6374 7041 71000 ...
##  $ Incident.Category                                   : chr  "Larceny Theft" "Other Offenses" "Lost Property" "Recovered Vehicle" ...
##  $ Incident.Subcategory                                : chr  "Larceny Theft - Other" "Other Offenses" "Lost Property" "Recovered Vehicle" ...
##  $ Incident.Description                                : chr  "Theft, Other Property, $50-$200" "License Plate, Recovered" "Lost Property" "Vehicle, Recovered, Motorcycle" ...
##  $ Resolution                                          : chr  "Open or Active" "Open or Active" "Open or Active" "Open or Active" ...
##  $ Intersection                                        : chr  "" "" "" "" ...
##  $ CNN                                                 : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Police.District                                     : chr  "Southern" "Out of SF" "Central" "Out of SF" ...
##  $ Analysis.Neighborhood                               : chr  "" "" "" "" ...
##  $ Supervisor.District                                 : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Latitude                                            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Longitude                                           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Point                                               : chr  "" "" "" "" ...
##  $ Neighborhoods                                       : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ ESNCAG...Boundary.File                              : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Central.Market.Tenderloin.Boundary.Polygon...Updated: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Civic.Center.Harm.Reduction.Project.Boundary        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ HSOC.Zones.as.of.2018.06.05                         : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Invest.In.Neighborhoods..IIN..Areas                 : logi  NA NA NA NA NA NA ...
##  $ Current.Supervisor.Districts                        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Current.Police.Districts                            : int  NA NA NA NA NA NA NA NA NA NA ...

Now let us check how many null values are present in the dataset.

sum(is.na(data))
## [1] 3441408

Let us check which columns have how many missing values.

missing_vals <- colSums(is.na(data))
missing_vals[sapply(missing_vals, function(x) any(x > 0))]
##                                           CAD.Number 
##                                               149698 
##                                                  CNN 
##                                                35458 
##                                  Supervisor.District 
##                                                35458 
##                                             Latitude 
##                                                35458 
##                                            Longitude 
##                                                35458 
##                                        Neighborhoods 
##                                                48996 
##                               ESNCAG...Boundary.File 
##                                               662313 
## Central.Market.Tenderloin.Boundary.Polygon...Updated 
##                                               583849 
##         Civic.Center.Harm.Reduction.Project.Boundary 
##                                               584097 
##                          HSOC.Zones.as.of.2018.06.05 
##                                               529243 
##                  Invest.In.Neighborhoods..IIN..Areas 
##                                               669605 
##                         Current.Supervisor.Districts 
##                                                35571 
##                             Current.Police.Districts 
##                                                36204

As we can see, there are some columns with more than 5 lakh nulll values. We can handle that when we are cleaning the data

Dataframe summary

print(summarytools::dfSummary(data,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

data

Dimensions: 669605 x 34
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
Incident.Datetime [character]
1. 2021/11/23 01:00:00 PM
2. 2018/01/01 12:00:00 AM
3. 2019/01/01 12:00:00 AM
4. 2022/04/19 03:30:00 AM
5. 2020/01/01 12:00:00 AM
6. 2021/01/01 12:00:00 AM
7. 2020/07/01 12:00:00 AM
8. 2022/01/01 12:00:00 AM
9. 2018/02/01 12:00:00 AM
10. 2019/04/01 12:00:00 AM
[ 315400 others ]
96(0.0%)
79(0.0%)
69(0.0%)
60(0.0%)
58(0.0%)
57(0.0%)
54(0.0%)
50(0.0%)
48(0.0%)
48(0.0%)
668986(99.9%)
0 (0.0%)
Incident.Date [character]
1. 2022/06/26
2. 2019/06/30
3. 2018/08/01
4. 2018/01/01
5. 2020/01/01
6. 2019/10/02
7. 2018/08/24
8. 2019/02/01
9. 2019/04/03
10. 2019/11/01
[ 1790 others ]
605(0.1%)
578(0.1%)
556(0.1%)
545(0.1%)
537(0.1%)
531(0.1%)
528(0.1%)
527(0.1%)
519(0.1%)
516(0.1%)
664163(99.2%)
0 (0.0%)
Incident.Time [character]
1. 00:00
2. 12:00
3. 18:00
4. 17:00
5. 20:00
6. 19:00
7. 15:00
8. 21:00
9. 16:00
10. 22:00
[ 1430 others ]
19549(2.9%)
18284(2.7%)
14164(2.1%)
13009(1.9%)
12739(1.9%)
12351(1.8%)
10887(1.6%)
10784(1.6%)
10731(1.6%)
10577(1.6%)
536530(80.1%)
0 (0.0%)
Incident.Year [integer]
Mean (sd) : 2019.9 (1.4)
min ≤ med ≤ max:
2018 ≤ 2020 ≤ 2022
IQR (CV) : 2 (0)
2018:152586(22.8%)
2019:148225(22.1%)
2020:118210(17.7%)
2021:128609(19.2%)
2022:121975(18.2%)
0 (0.0%)
Incident.Day.of.Week [character]
1. Friday
2. Monday
3. Saturday
4. Sunday
5. Thursday
6. Tuesday
7. Wednesday
102540(15.3%)
94957(14.2%)
94630(14.1%)
88415(13.2%)
95492(14.3%)
94401(14.1%)
99170(14.8%)
0 (0.0%)
Report.Datetime [character]
1. 2021/11/23 01:00:00 PM
2. 2022/04/19 03:30:00 AM
3. 2018/06/27 07:30:00 AM
4. 2022/11/05 12:00:00 AM
5. 2019/02/27 05:19:00 AM
6. 2019/10/10 12:00:00 PM
7. 2018/02/08 07:23:00 PM
8. 2019/02/02 02:13:00 PM
9. 2018/06/26 10:27:00 AM
10. 2018/12/07 01:00:00 PM
[ 476030 others ]
90(0.0%)
55(0.0%)
48(0.0%)
45(0.0%)
34(0.0%)
33(0.0%)
26(0.0%)
24(0.0%)
21(0.0%)
21(0.0%)
669208(99.9%)
0 (0.0%)
Row.ID [numeric]
Mean (sd) : 92077585963 (17320765942)
min ≤ med ≤ max:
61868707041 ≤ 92058007023 ≤ 122158171013
IQR (CV) : 29928443293 (0.2)
669605 distinct values 0 (0.0%)
Incident.ID [integer]
Mean (sd) : 920775.6 (173207.7)
min ≤ med ≤ max:
618687 ≤ 920580 ≤ 1221581
IQR (CV) : 299285 (0.2)
558805 distinct values 0 (0.0%)
Incident.Number [integer]
Mean (sd) : 200375154 (14865014)
min ≤ med ≤ max:
0 ≤ 200286379 ≤ 981424262
IQR (CV) : 20565879 (0.1)
482981 distinct values 0 (0.0%)
CAD.Number [integer]
Mean (sd) : 201300291 (23510885)
min ≤ med ≤ max:
1 ≤ 201212528 ≤ 1e+09
IQR (CV) : 21952652 (0.1)
382364 distinct values 149698 (22.4%)
Report.Type.Code [character]
1. II
2. IS
3. VI
4. VS
528345(78.9%)
70229(10.5%)
42319(6.3%)
28712(4.3%)
0 (0.0%)
Report.Type.Description [character]
1. Coplogic Initial
2. Coplogic Supplement
3. Initial
4. Initial Supplement
5. Vehicle Initial
6. Vehicle Supplement
119827(17.9%)
15067(2.3%)
408518(61.0%)
55162(8.2%)
42319(6.3%)
28712(4.3%)
0 (0.0%)
Filed.Online [character]
1. (Empty string)
2. true
534711(79.9%)
134894(20.1%)
0 (0.0%)
Incident.Code [integer]
Mean (sd) : 24633.5 (25655.8)
min ≤ med ≤ max:
1000 ≤ 7041 ≤ 75030
IQR (CV) : 44796 (1)
836 distinct values 0 (0.0%)
Incident.Category [character]
1. Larceny Theft
2. Other Miscellaneous
3. Malicious Mischief
4. Assault
5. Non-Criminal
6. Burglary
7. Motor Vehicle Theft
8. Recovered Vehicle
9. Fraud
10. Lost Property
[ 40 others ]
204121(30.5%)
47065(7.0%)
45368(6.8%)
40680(6.1%)
40069(6.0%)
37685(5.6%)
33794(5.0%)
25807(3.9%)
21054(3.1%)
19705(2.9%)
154257(23.0%)
0 (0.0%)
Incident.Subcategory [character]
1. Larceny - From Vehicle
2. Other
3. Larceny Theft - Other
4. Vandalism
5. Motor Vehicle Theft
6. Recovered Vehicle
7. Simple Assault
8. Non-Criminal
9. Fraud
10. Lost Property
[ 62 others ]
115908(17.3%)
84040(12.6%)
47136(7.0%)
45025(6.7%)
33323(5.0%)
25807(3.9%)
25254(3.8%)
22909(3.4%)
22040(3.3%)
19705(2.9%)
228458(34.1%)
0 (0.0%)
Incident.Description [character]
1. Theft, From Locked Vehicl
2. Malicious Mischief, Vanda
3. Lost Property
4. Battery
5. Theft, Other Property, $5
6. Vehicle, Recovered, Auto
7. Vehicle, Stolen, Auto
8. Theft, Other Property, >$
9. Mental Health Detention
10. Theft, From Unlocked Vehi
[ 823 others ]
84827(12.7%)
22351(3.3%)
19705(2.9%)
19664(2.9%)
18884(2.8%)
18713(2.8%)
18113(2.7%)
15918(2.4%)
14971(2.2%)
12525(1.9%)
423934(63.3%)
0 (0.0%)
Resolution [character]
1. Cite or Arrest Adult
2. Exceptional Adult
3. Open or Active
4. Unfounded
129278(19.3%)
1751(0.3%)
534904(79.9%)
3672(0.5%)
0 (0.0%)
Intersection [character]
1. (Empty string)
2. MARKET ST \ POWELL ST
3. BOARDMAN PL \ BRYANT ST
4. POWELL ST \ OFARRELL ST
5. EDDY ST \ JONES ST
6. 08TH ST \ GROVE ST \ HYDE
7. 20TH AVE \ WINSTON DR
8. 16TH ST \ MISSION ST
9. 04TH ST \ LONG BRIDGE ST
10. HYDE ST \ TURK ST
[ 6369 others ]
35458(5.3%)
3824(0.6%)
3193(0.5%)
3185(0.5%)
2908(0.4%)
2614(0.4%)
2576(0.4%)
2557(0.4%)
2393(0.4%)
2178(0.3%)
608719(90.9%)
0 (0.0%)
CNN [numeric]
Mean (sd) : 25322764 (3094157)
min ≤ med ≤ max:
20013000 ≤ 24925000 ≤ 54203000
IQR (CV) : 2512000 (0.1)
6466 distinct values 35458 (5.3%)
Police.District [character]
1. Bayview
2. Central
3. Ingleside
4. Mission
5. Northern
6. Out of SF
7. Park
8. Richmond
9. Southern
10. Taraval
11. Tenderloin
59192(8.8%)
99986(14.9%)
50612(7.6%)
84307(12.6%)
91046(13.6%)
19675(2.9%)
31124(4.6%)
41762(6.2%)
80809(12.1%)
46954(7.0%)
64138(9.6%)
0 (0.0%)
Analysis.Neighborhood [character]
1. Mission
2. Tenderloin
3. Financial District/South
4. South of Market
5. Bayview Hunters Point
6. (Empty string)
7. North Beach
8. Western Addition
9. Castro/Upper Market
10. Sunset/Parkside
[ 33 others ]
67917(10.1%)
65086(9.7%)
52137(7.8%)
51467(7.7%)
41620(6.2%)
35458(5.3%)
21293(3.2%)
20446(3.1%)
19015(2.8%)
19010(2.8%)
276156(41.2%)
0 (0.0%)
Supervisor.District [integer]
Mean (sd) : 6 (2.8)
min ≤ med ≤ max:
1 ≤ 6 ≤ 11
IQR (CV) : 5 (0.5)
11 distinct values 35458 (5.3%)
Latitude [numeric]
Mean (sd) : 37.8 (0)
min ≤ med ≤ max:
37.7 ≤ 37.8 ≤ 37.8
IQR (CV) : 0 (0)
6466 distinct values 35458 (5.3%)
Longitude [numeric]
Mean (sd) : -122.4 (0)
min ≤ med ≤ max:
-122.5 ≤ -122.4 ≤ -122.4
IQR (CV) : 0 (0)
6466 distinct values 35458 (5.3%)
Point [character]
1. (Empty string)
2. POINT (-122.4073370416223
3. POINT (-122.4036355194344
4. POINT (-122.4080362374447
5. POINT (-122.4125952775858
6. POINT (-122.4147412230519
7. POINT (-122.4760394734943
8. POINT (-122.4196689738014
9. POINT (-122.3914343365214
10. POINT (-122.4156938744122
[ 6457 others ]
35458(5.3%)
3824(0.6%)
3193(0.5%)
3185(0.5%)
2908(0.4%)
2614(0.4%)
2576(0.4%)
2557(0.4%)
2393(0.4%)
2178(0.3%)
608719(90.9%)
0 (0.0%)
Neighborhoods [integer]
Mean (sd) : 53 (32.6)
min ≤ med ≤ max:
1 ≤ 50 ≤ 117
IQR (CV) : 63 (0.6)
116 distinct values 48996 (7.3%)
ESNCAG...Boundary.File [integer] 1 distinct value
1:7292(100.0%)
662313 (98.9%)
Central.Market.Tenderloin.Boundary.Polygon...Updated [integer] 1 distinct value
1:85756(100.0%)
583849 (87.2%)
Civic.Center.Harm.Reduction.Project.Boundary [integer] 1 distinct value
1:85508(100.0%)
584097 (87.2%)
HSOC.Zones.as.of.2018.06.05 [integer]
Mean (sd) : 2.3 (1.4)
min ≤ med ≤ max:
1 ≤ 3 ≤ 5
IQR (CV) : 2 (0.6)
1:63203(45.0%)
2:2613(1.9%)
3:54423(38.8%)
4:2865(2.0%)
5:17258(12.3%)
529243 (79.0%)
Invest.In.Neighborhoods..IIN..Areas [logical]
All NA's
669605 (100.0%)
Current.Supervisor.Districts [integer]
Mean (sd) : 6.7 (3.3)
min ≤ med ≤ max:
1 ≤ 7 ≤ 11
IQR (CV) : 7 (0.5)
11 distinct values 35571 (5.3%)
Current.Police.Districts [integer]
Mean (sd) : 4.9 (2.7)
min ≤ med ≤ max:
1 ≤ 5 ≤ 10
IQR (CV) : 4 (0.6)
1:80966(12.8%)
2:59311(9.4%)
3:82796(13.1%)
4:90565(14.3%)
5:60664(9.6%)
6:96061(15.2%)
7:29220(4.6%)
8:36598(5.8%)
9:48995(7.7%)
10:48225(7.6%)
36204 (5.4%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2022-12-22

head(data)
##        Incident.Datetime Incident.Date Incident.Time Incident.Year
## 1 2021/07/25 12:00:00 AM    2021/07/25         00:00          2021
## 2 2022/06/28 11:58:00 PM    2022/06/28         23:58          2022
## 3 2022/03/11 10:30:00 AM    2022/03/11         10:30          2022
## 4 2021/05/15 05:47:00 PM    2021/05/15         17:47          2021
## 5 2022/06/28 05:22:00 PM    2022/06/28         17:22          2022
## 6 2021/11/18 01:30:00 PM    2021/11/18         13:30          2021
##   Incident.Day.of.Week        Report.Datetime       Row.ID Incident.ID
## 1               Sunday 2021/07/25 01:41:00 PM 105718906372     1057189
## 2              Tuesday 2022/06/28 11:58:00 PM 116554371012     1165543
## 3               Friday 2022/03/11 08:03:00 PM 113048071000     1130480
## 4             Saturday 2021/05/15 05:47:00 PM 103051807043     1030518
## 5              Tuesday 2022/06/28 05:22:00 PM 116535107041     1165351
## 6             Thursday 2021/11/18 04:24:00 PM 109435228150     1094352
##   Incident.Number CAD.Number Report.Type.Code Report.Type.Description
## 1       216105573         NA               II        Coplogic Initial
## 2       220264913         NA               VS      Vehicle Supplement
## 3       226040232         NA               II        Coplogic Initial
## 4       210183345         NA               VS      Vehicle Supplement
## 5       220361741         NA               VS      Vehicle Supplement
## 6       216178097         NA               II        Coplogic Initial
##   Filed.Online Incident.Code  Incident.Category  Incident.Subcategory
## 1         true          6372      Larceny Theft Larceny Theft - Other
## 2                      71012     Other Offenses        Other Offenses
## 3         true         71000      Lost Property         Lost Property
## 4                       7043  Recovered Vehicle     Recovered Vehicle
## 5                       7041  Recovered Vehicle     Recovered Vehicle
## 6         true         28150 Malicious Mischief             Vandalism
##                        Incident.Description     Resolution Intersection CNN
## 1           Theft, Other Property, $50-$200 Open or Active               NA
## 2                  License Plate, Recovered Open or Active               NA
## 3                             Lost Property Open or Active               NA
## 4            Vehicle, Recovered, Motorcycle Open or Active               NA
## 5                  Vehicle, Recovered, Auto Open or Active               NA
## 6 Malicious Mischief, Vandalism to Property Open or Active               NA
##   Police.District Analysis.Neighborhood Supervisor.District Latitude Longitude
## 1        Southern                                        NA       NA        NA
## 2       Out of SF                                        NA       NA        NA
## 3         Central                                        NA       NA        NA
## 4       Out of SF                                        NA       NA        NA
## 5       Out of SF                                        NA       NA        NA
## 6         Mission                                        NA       NA        NA
##   Point Neighborhoods ESNCAG...Boundary.File
## 1                  NA                     NA
## 2                  NA                     NA
## 3                  NA                     NA
## 4                  NA                     NA
## 5                  NA                     NA
## 6                  NA                     NA
##   Central.Market.Tenderloin.Boundary.Polygon...Updated
## 1                                                   NA
## 2                                                   NA
## 3                                                   NA
## 4                                                   NA
## 5                                                   NA
## 6                                                   NA
##   Civic.Center.Harm.Reduction.Project.Boundary HSOC.Zones.as.of.2018.06.05
## 1                                           NA                          NA
## 2                                           NA                          NA
## 3                                           NA                          NA
## 4                                           NA                          NA
## 5                                           NA                          NA
## 6                                           NA                          NA
##   Invest.In.Neighborhoods..IIN..Areas Current.Supervisor.Districts
## 1                                  NA                           NA
## 2                                  NA                           NA
## 3                                  NA                           NA
## 4                                  NA                           NA
## 5                                  NA                           NA
## 6                                  NA                           NA
##   Current.Police.Districts
## 1                       NA
## 2                       NA
## 3                       NA
## 4                       NA
## 5                       NA
## 6                       NA

Data Cleaning

The columns below have maximum of null values and wont be much benefical to us for analysis. Hence we can drop them. Civic.Center.Harm.Reduction.Project.Boundary HSOC.Zones.as.of.2018.06.05 Invest.In.Neighborhoods..IIN..Areas ESNCAG…Boundary.File Central.Market.Tenderloin.Boundary.Polygon…Updated Civic.Center.Harm.Reduction.Project.Boundary

cleaned_data <- select(data,-28,-29,-30,-31,-32,-33,-34)
head(cleaned_data)
##        Incident.Datetime Incident.Date Incident.Time Incident.Year
## 1 2021/07/25 12:00:00 AM    2021/07/25         00:00          2021
## 2 2022/06/28 11:58:00 PM    2022/06/28         23:58          2022
## 3 2022/03/11 10:30:00 AM    2022/03/11         10:30          2022
## 4 2021/05/15 05:47:00 PM    2021/05/15         17:47          2021
## 5 2022/06/28 05:22:00 PM    2022/06/28         17:22          2022
## 6 2021/11/18 01:30:00 PM    2021/11/18         13:30          2021
##   Incident.Day.of.Week        Report.Datetime       Row.ID Incident.ID
## 1               Sunday 2021/07/25 01:41:00 PM 105718906372     1057189
## 2              Tuesday 2022/06/28 11:58:00 PM 116554371012     1165543
## 3               Friday 2022/03/11 08:03:00 PM 113048071000     1130480
## 4             Saturday 2021/05/15 05:47:00 PM 103051807043     1030518
## 5              Tuesday 2022/06/28 05:22:00 PM 116535107041     1165351
## 6             Thursday 2021/11/18 04:24:00 PM 109435228150     1094352
##   Incident.Number CAD.Number Report.Type.Code Report.Type.Description
## 1       216105573         NA               II        Coplogic Initial
## 2       220264913         NA               VS      Vehicle Supplement
## 3       226040232         NA               II        Coplogic Initial
## 4       210183345         NA               VS      Vehicle Supplement
## 5       220361741         NA               VS      Vehicle Supplement
## 6       216178097         NA               II        Coplogic Initial
##   Filed.Online Incident.Code  Incident.Category  Incident.Subcategory
## 1         true          6372      Larceny Theft Larceny Theft - Other
## 2                      71012     Other Offenses        Other Offenses
## 3         true         71000      Lost Property         Lost Property
## 4                       7043  Recovered Vehicle     Recovered Vehicle
## 5                       7041  Recovered Vehicle     Recovered Vehicle
## 6         true         28150 Malicious Mischief             Vandalism
##                        Incident.Description     Resolution Intersection CNN
## 1           Theft, Other Property, $50-$200 Open or Active               NA
## 2                  License Plate, Recovered Open or Active               NA
## 3                             Lost Property Open or Active               NA
## 4            Vehicle, Recovered, Motorcycle Open or Active               NA
## 5                  Vehicle, Recovered, Auto Open or Active               NA
## 6 Malicious Mischief, Vandalism to Property Open or Active               NA
##   Police.District Analysis.Neighborhood Supervisor.District Latitude Longitude
## 1        Southern                                        NA       NA        NA
## 2       Out of SF                                        NA       NA        NA
## 3         Central                                        NA       NA        NA
## 4       Out of SF                                        NA       NA        NA
## 5       Out of SF                                        NA       NA        NA
## 6         Mission                                        NA       NA        NA
##   Point Neighborhoods
## 1                  NA
## 2                  NA
## 3                  NA
## 4                  NA
## 5                  NA
## 6                  NA

###Mutate data

We can split the column Incident Time column into Minute and hour, so we can manipulate the hour column in further analysis.

mutated_data <- cleaned_data %>%
  separate('Incident.Time',c('Incident.Hour','Incident.Minute'),sep = ":")


mutated_data <- mutated_data %>% 
  mutate(Incident.Month = month(Incident.Datetime, label = T))
crime <- mutated_data

###QUESTIONS:

After observing the data, certain questions pop up in my mind.We can make observations about crime has increased over the years, how COVID has affected the crime rate, the district wise distribution of crime, what is the most frequent resolution in different districts,what are the top crimes and what time does it occur,what are the crime hotspots.

I will now go over these questions and try to plot interesting graphs so we can get some inference from it.

##Crime over the years:

#a)Distribution of crime over the years from 2018 - Present.

We can group by the Incident Year and plot the graph.

crime_per_year <- crime %>% 
  group_by(Incident.Year) %>% 
  summarise(total = n())
colnames(crime_per_year) <- c("Year","Total") 
library(ggplot2)
theme_set(theme_classic())

# Plot
g <- ggplot(crime_per_year, aes(Year, Total))
g + geom_bar(stat="identity", width = 0.5, fill="blue") + 
      labs(title="Crime in San Francisco", 
           subtitle="2018-2022", 
           caption=" Total crime over the years") +
      theme(axis.text.x = element_text(angle=65, vjust=0.6))

From this graph we can observe that the crime has reduced during COVID years of 2020-2021, which can be an indicator that either the crime had reduced due to lockdown conditions or the reported crime had reduced during those years. We can also observe the crime decreasing slightly from 2018 to Present day.

#b) Distribution of crime per category

We can create a dataframe, by grouping by on the basis of Incident Category and also calculate the percentage.

crime_category <- sort(table(crime$Incident.Category),decreasing = TRUE)
crime_category <- data.frame(crime_category[crime_category> 10000])
colnames(crime_category) <- c("Category", "Frequency")
crime_category$Percentage <- crime_category$Frequency / sum(crime_category$Frequency)
crime_category
##               Category Frequency Percentage
## 1        Larceny Theft    204121 0.33714656
## 2  Other Miscellaneous     47065 0.07773724
## 3   Malicious Mischief     45368 0.07493430
## 4              Assault     40680 0.06719114
## 5         Non-Criminal     40069 0.06618195
## 6             Burglary     37685 0.06224430
## 7  Motor Vehicle Theft     33794 0.05581753
## 8    Recovered Vehicle     25807 0.04262541
## 9                Fraud     21054 0.03477488
## 10       Lost Property     19705 0.03254674
## 11             Warrant     19623 0.03241130
## 12        Drug Offense     16105 0.02660062
## 13             Robbery     15000 0.02477549
## 14      Missing Person     14516 0.02397607
## 15      Suspicious Occ     13788 0.02277363
## 16  Disorderly Conduct     11057 0.01826284
library(ggplot2)
library(ggrepel)
bp<-ggplot(crime_category, aes(x=Category, y=Frequency, fill=Category)) + geom_bar(stat="identity") + 
  theme(axis.text.x=element_blank()) + geom_text_repel(data=crime_category, aes(label=Category))
bp

From the plot we can observe that count of Larceny Theft, is much much greater than the other categories.Disorderly conduct is the least frequent category.

#c) Time Series graph for Daily Crimes from 2018-2022

We can mutate the date and group by date, to get the number of crimes each day.

library(dplyr)

df_crime_daily <- crime %>%
  mutate(Date = as.Date(Incident.Date, "%Y/%m/%d")) %>%
  group_by(Date) %>%
  summarize(count = n()) %>%
  arrange(Date)
head(df_crime_daily)
## # A tibble: 6 × 2
##   Date       count
##   <date>     <int>
## 1 2018-01-01   545
## 2 2018-01-02   393
## 3 2018-01-03   429
## 4 2018-01-04   468
## 5 2018-01-05   447
## 6 2018-01-06   451
library(ggplot2)
library(scales)
plot <- ggplot(df_crime_daily, aes(x = Date, y = count)) +
  geom_line(color = "#F2CA27", size = 0.1) +
  geom_smooth(color = "#1A1A1A") +
  # fte_theme() +
  scale_x_date(breaks = date_breaks("1 year"), labels = date_format("%Y")) +
  labs(x = "Date of Crime", y = "Number of Crimes", title = "Daily Crimes in San Francisco from 2018 – 2022")
plot

We can again oberve the dip during the COVID lockdown period, and in general observe around 300-500 crimes per day. With unual low as well as high spikes.

##Crime in District -wise

#a) District wise Crime Distribution

We can create a dataframe by grouping by the Police district.

crime_per_district <- crime %>% 
  group_by(Police.District) %>% 
  summarise(n = n())

colnames(crime_per_district) <- c("Police.District", "Total")
head(crime_per_district )
## # A tibble: 6 × 2
##   Police.District Total
##   <chr>           <int>
## 1 Bayview         59192
## 2 Central         99986
## 3 Ingleside       50612
## 4 Mission         84307
## 5 Northern        91046
## 6 Out of SF       19675
g <- ggplot(crime_per_district, aes(Police.District, Total))
g + geom_bar(stat="identity", width = 0.5, fill="pink") + 
      labs(title="District-wise crime in San Francisco", 
           subtitle="2018-2022", 
           caption=" Total crime in each district") +
      theme(axis.text.x = element_text(angle=65, vjust=0.6))

From the graph, we can see Central having maximum number of crime and Out of SF is the least.Park can be said to be the safest district as it has the least amount of crime.

#b) Resolution rates in each district

g <- ggplot(crime, aes(Police.District))
g + geom_bar(aes(fill=Resolution), width = 0.5) + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
  labs(title="Resolution in each District", 
     )

From this graph, we can observe that maximum number of cases in all the districts are open or active.Tenderloin has the most arrests and Park has the least. There are some unfounded resolutions in each District.

##Temporal Trends

#a) Hourly Crime Distribution

We can create a dataframe by grouping by Incident Hour.

crime_perhours <- crime %>% 
  group_by(Incident.Hour) %>% 
  summarise(TotalCrime = n())
head(crime_perhours)
## # A tibble: 6 × 2
##   Incident.Hour TotalCrime
##   <chr>              <int>
## 1 00                 36046
## 2 01                 17951
## 3 02                 15561
## 4 03                 12636
## 5 04                 10651
## 6 05                 10034
ggplot(crime_perhours, aes(x = Incident.Hour, y = TotalCrime))+
  geom_col(fill = "orange")+
  theme_minimal()+
  labs(
    title = "Crime per hour, San Francisco 2018 - 2022",
    x = "Hours",
    y = "Total Crime"
  )

We can observe that maximum crimes occur around 12 in the afternoon, then another spike at midnight. It then gradually decreases from 1 am - 7 am in the morning.

#b) Theft time Heatmap

Of the above hourly districution, let us focus on Larcent Theft and plot the heatmap.

df_theft_time <- crime %>%
  filter(Incident.Category=="Larceny Theft")%>%
  group_by(Incident.Day.of.Week, Incident.Hour) %>%
  summarize(count = n())
head(df_theft_time)
## # A tibble: 6 × 3
## # Groups:   Incident.Day.of.Week [1]
##   Incident.Day.of.Week Incident.Hour count
##   <chr>                <chr>         <int>
## 1 Friday               00             1336
## 2 Friday               01              600
## 3 Friday               02              459
## 4 Friday               03              417
## 5 Friday               04              291
## 6 Friday               05              300
plot <- ggplot(df_theft_time, aes(x = Incident.Hour, y = Incident.Day.of.Week, fill = count)) +
  geom_tile() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.6), legend.title = element_blank(), legend.position="top", legend.direction="horizontal", legend.key.width=unit(2, "cm"), legend.key.height=unit(0.25, "cm"), legend.margin=unit(-0.5,"cm"), panel.margin=element_blank()) +
  labs(x = "Hour of Theft (Local Time)", y = "Day of Week of Theft", title = "Number of Thefts in San Francisco from 2018 – 2022, by Time of Theft") +
  scale_fill_gradient(low = "white", high = "orange")
plot

From this heatmap, we see a spike in crimes over the weekends in the midnight. We can also seespikes during weekdays around the evenig time from 5pm-8-pm.

##Crime hotspots

#a) Map of San Francisco

library(ggmap)

sf = get_stamenmap(bbox = c(left = -122.5164, bottom = 37.7066, right = -122.3554, top = 37.8103), 
maptype = c("toner-lite"), zoom = 13)

map = ggmap(sf)
map

#b) Plotting the first 500 random points on the map

map + geom_point(data = sample_n(crime, 500), aes(x = Longitude, y = Latitude))

We can already observe a clustering towards the city center.

#c) Density Plot

map + 
stat_density2d( data = sample_frac(crime, 0.2), aes(x = Longitude, y = Latitude, fill = ..level.., alpha = ..level..), size = 1, bins = 50, geom = 'polygon') +
scale_fill_gradient('Crime\nDensity', low = 'blue', high = 'orange') +
scale_alpha(range = c(.2, .3), guide = FALSE) +
guides(fill = guide_colorbar(barwidth = 1.5, barheight = 10))

We can observe certain intersections and having higher density and the maximum density being at the city center area.The parks and grassland regions have no reported crimes.

###Conclusion

Thus with the help of the SAN FRANCISCO crime dataset, we could plot thee graphs and gain some insight.